23 research outputs found
A Comparison of Reinforcement Learning Frameworks for Software Testing Tasks
Software testing activities scrutinize the artifacts and the behavior of a
software product to find possible defects and ensure that the product meets its
expected requirements. Recently, Deep Reinforcement Learning (DRL) has been
successfully employed in complex testing tasks such as game testing, regression
testing, and test case prioritization to automate the process and provide
continuous adaptation. Practitioners can employ DRL by implementing from
scratch a DRL algorithm or using a DRL framework. DRL frameworks offer
well-maintained implemented state-of-the-art DRL algorithms to facilitate and
speed up the development of DRL applications. Developers have widely used these
frameworks to solve problems in various domains including software testing.
However, to the best of our knowledge, there is no study that empirically
evaluates the effectiveness and performance of implemented algorithms in DRL
frameworks. Moreover, some guidelines are lacking from the literature that
would help practitioners choose one DRL framework over another. In this paper,
we empirically investigate the applications of carefully selected DRL
algorithms on two important software testing tasks: test case prioritization in
the context of Continuous Integration (CI) and game testing. For the game
testing task, we conduct experiments on a simple game and use DRL algorithms to
explore the game to detect bugs. Results show that some of the selected DRL
frameworks such as Tensorforce outperform recent approaches in the literature.
To prioritize test cases, we run experiments on a CI environment where DRL
algorithms from different frameworks are used to rank the test cases. Our
results show that the performance difference between implemented algorithms in
some cases is considerable, motivating further investigation.Comment: Accepted for publication at EMSE (Empirical Software Engineering
journal) 202
Faults in Deep Reinforcement Learning Programs: A Taxonomy and A Detection Approach
A growing demand is witnessed in both industry and academia for employing
Deep Learning (DL) in various domains to solve real-world problems. Deep
Reinforcement Learning (DRL) is the application of DL in the domain of
Reinforcement Learning (RL). Like any software systems, DRL applications can
fail because of faults in their programs. In this paper, we present the first
attempt to categorize faults occurring in DRL programs. We manually analyzed
761 artifacts of DRL programs (from Stack Overflow posts and GitHub issues)
developed using well-known DRL frameworks (OpenAI Gym, Dopamine, Keras-rl,
Tensorforce) and identified faults reported by developers/users. We labeled and
taxonomized the identified faults through several rounds of discussions. The
resulting taxonomy is validated using an online survey with 19
developers/researchers. To allow for the automatic detection of faults in DRL
programs, we have defined a meta-model of DRL programs and developed DRLinter,
a model-based fault detection approach that leverages static analysis and graph
transformations. The execution flow of DRLinter consists in parsing a DRL
program to generate a model conforming to our meta-model and applying detection
rules on the model to identify faults occurrences. The effectiveness of
DRLinter is evaluated using 15 synthetic DRLprograms in which we injected
faults observed in the analyzed artifacts of the taxonomy. The results show
that DRLinter can successfully detect faults in all synthetic faulty programs
Automatic Fault Detection for Deep Learning Programs Using Graph Transformations
Nowadays, we are witnessing an increasing demand in both corporates and
academia for exploiting Deep Learning (DL) to solve complex real-world
problems. A DL program encodes the network structure of a desirable DL model
and the process by which the model learns from the training dataset. Like any
software, a DL program can be faulty, which implies substantial challenges of
software quality assurance, especially in safety-critical domains. It is
therefore crucial to equip DL development teams with efficient fault detection
techniques and tools. In this paper, we propose NeuraLint, a model-based fault
detection approach for DL programs, using meta-modelling and graph
transformations. First, we design a meta-model for DL programs that includes
their base skeleton and fundamental properties. Then, we construct a
graph-based verification process that covers 23 rules defined on top of the
meta-model and implemented as graph transformations to detect faults and design
inefficiencies in the generated models (i.e., instances of the meta-model).
First, the proposed approach is evaluated by finding faults and design
inefficiencies in 28 synthesized examples built from common problems reported
in the literature. Then NeuraLint successfully finds 64 faults and design
inefficiencies in 34 real-world DL programs extracted from Stack Overflow posts
and GitHub repositories. The results show that NeuraLint effectively detects
faults and design issues in both synthesized and real-world examples with a
recall of 70.5 % and a precision of 100 %. Although the proposed meta-model is
designed for feedforward neural networks, it can be extended to support other
neural network architectures such as recurrent neural networks. Researchers can
also expand our set of verification rules to cover more types of issues in DL
programs
Effective Test Generation Using Pre-trained Large Language Models and Mutation Testing
One of the critical phases in software development is software testing.
Testing helps with identifying potential bugs and reducing maintenance costs.
The goal of automated test generation tools is to ease the development of tests
by suggesting efficient bug-revealing tests. Recently, researchers have
leveraged Large Language Models (LLMs) of code to generate unit tests. While
the code coverage of generated tests was usually assessed, the literature has
acknowledged that the coverage is weakly correlated with the efficiency of
tests in bug detection. To improve over this limitation, in this paper, we
introduce MuTAP for improving the effectiveness of test cases generated by LLMs
in terms of revealing bugs by leveraging mutation testing. Our goal is achieved
by augmenting prompts with surviving mutants, as those mutants highlight the
limitations of test cases in detecting bugs. MuTAP is capable of generating
effective test cases in the absence of natural language descriptions of the
Program Under Test (PUTs). We employ different LLMs within MuTAP and evaluate
their performance on different benchmarks. Our results show that our proposed
method is able to detect up to 28% more faulty human-written code snippets.
Among these, 17% remained undetected by both the current state-of-the-art fully
automated test generation tool (i.e., Pynguin) and zero-shot/few-shot learning
approaches on LLMs. Furthermore, MuTAP achieves a Mutation Score (MS) of 93.57%
on synthetic buggy code, outperforming all other approaches in our evaluation.
Our findings suggest that although LLMs can serve as a useful tool to generate
test cases, they require specific post-processing steps to enhance the
effectiveness of the generated test cases which may suffer from syntactic or
functional errors and may be ineffective in detecting certain types of bugs and
testing corner cases PUTs.Comment: 16 pages, 3 figure
Quality Issues in Machine Learning Software Systems
Context: An increasing demand is observed in various domains to employ
Machine Learning (ML) for solving complex problems. ML models are implemented
as software components and deployed in Machine Learning Software Systems
(MLSSs). Problem: There is a strong need for ensuring the serving quality of
MLSSs. False or poor decisions of such systems can lead to malfunction of other
systems, significant financial losses, or even threats to human life. The
quality assurance of MLSSs is considered a challenging task and currently is a
hot research topic. Objective: This paper aims to investigate the
characteristics of real quality issues in MLSSs from the viewpoint of
practitioners. This empirical study aims to identify a catalog of quality
issues in MLSSs. Method: We conduct a set of interviews with
practitioners/experts, to gather insights about their experience and practices
when dealing with quality issues. We validate the identified quality issues via
a survey with ML practitioners. Results: Based on the content of 37 interviews,
we identified 18 recurring quality issues and 24 strategies to mitigate them.
For each identified issue, we describe the causes and consequences according to
the practitioners' experience. Conclusion: We believe the catalog of issues
developed in this study will allow the community to develop efficient quality
assurance tools for ML models and MLSSs. A replication package of our study is
available on our public GitHub repository
How to Certify Machine Learning Based Safety-critical Systems? A Systematic Literature Review
Context: Machine Learning (ML) has been at the heart of many innovations over
the past years. However, including it in so-called 'safety-critical' systems
such as automotive or aeronautic has proven to be very challenging, since the
shift in paradigm that ML brings completely changes traditional certification
approaches.
Objective: This paper aims to elucidate challenges related to the
certification of ML-based safety-critical systems, as well as the solutions
that are proposed in the literature to tackle them, answering the question 'How
to Certify Machine Learning Based Safety-critical Systems?'.
Method: We conduct a Systematic Literature Review (SLR) of research papers
published between 2015 to 2020, covering topics related to the certification of
ML systems. In total, we identified 217 papers covering topics considered to be
the main pillars of ML certification: Robustness, Uncertainty, Explainability,
Verification, Safe Reinforcement Learning, and Direct Certification. We
analyzed the main trends and problems of each sub-field and provided summaries
of the papers extracted.
Results: The SLR results highlighted the enthusiasm of the community for this
subject, as well as the lack of diversity in terms of datasets and type of
models. It also emphasized the need to further develop connections between
academia and industries to deepen the domain study. Finally, it also
illustrated the necessity to build connections between the above mention main
pillars that are for now mainly studied separately.
Conclusion: We highlighted current efforts deployed to enable the
certification of ML based software systems, and discuss some future research
directions.Comment: 60 pages (92 pages with references and complements), submitted to a
journal (Automated Software Engineering). Changes: Emphasizing difference
traditional software engineering / ML approach. Adding Related Works, Threats
to Validity and Complementary Materials. Adding a table listing papers
reference for each section/subsection
Improved reinforcement learning in cooperative multi-agent environments using knowledge transfer
Nowadays, cooperative multi-agent systems are used to learn how to achieve
goals in large-scale dynamic environments. However, learning in these
environments is challenging: from the effect of search space size on learning
time to inefficient cooperation among agents. Moreover, reinforcement learning
algorithms may suffer from a long time of convergence in such environments. In
this paper, a communication framework is introduced. In the proposed
communication framework, agents learn to cooperate effectively and also by
introduction of a new state calculation method the size of state space will
decline considerably. Furthermore, a knowledge-transferring algorithm is
presented to share the gained experiences among the different agents, and
develop an effective knowledge-fusing mechanism to fuse the knowledge learnt
utilizing the agents' own experiences with the knowledge received from other
team members. Finally, the simulation results are provided to indicate the
efficacy of the proposed method in the complex learning task. We have evaluated
our approach on the shepherding problem and the results show that the learning
process accelerates by making use of the knowledge transferring mechanism and
the size of state space has declined by generating similar states based on
state abstraction concept.Comment: Accepted for publication by The Journal of Supercomputin